Automated Discovery of Internet Censorship by Web Crawling
نویسندگان
چکیده
Censorship of the Internet is widespread around the world. As access to the web becomes increasingly ubiquitous, filtering of this resource becomes more pervasive. Transparency about specific content that citizens are denied access to is atypical. To counter this, numerous techniques for maintaining URL filter lists have been proposed by various individuals and organisations that aim to empirical data on censorship for benefit of the public and wider censorship research community. We present a new approach for discovering filtered domains in different countries. This method is fully automated and requires no human interaction. The system uses web crawling techniques to traverse between filtered sites and implements a robust method for determining if a domain is filtered. We demonstrate the effectiveness of the approach by running experiments to search for filtered content in four different censorship regimes. Our results show that we perform better than the current state of the art and have built domain filter lists an order of magnitude larger than the most widely available public lists as of Jan 2018. Further, we build a dataset mapping the interlinking nature of blocked content between domains and exhibit the tightly networked nature of censored web resources.
منابع مشابه
Prioritize the ordering of URL queue in Focused crawler
The enormous growth of the World Wide Web in recent years has made it necessary to perform resource discovery efficiently. For a crawler it is not an simple task to download the domain specific web pages. This unfocused approach often shows undesired results. Therefore, several new ideas have been proposed, among them a key technique is focused crawling which is able to crawl particular topical...
متن کاملFinding and Emulating Keyboard, Mouse, and Touch Interactions and Gestures while Crawling RIA's
Crawling JavaScript heavy Rich Internet Applications has been a hot topic in recent years, giving us automated tools for indexing content, test generation, and securityand accessibility evaluation to mention a few examples. However, existing crawling techniques tend to ignore user interactions beyond mouse clicking, and therefore often fail to consider potential mouse, keyboard and touch intera...
متن کاملOntology-Learning-Based Focused Crawling for Online Service Advertising Information Discovery and Classification
Online advertising has become increasingly popular among SMEs in service industries, and thousands of service advertisements are published on the Internet every day. However, there is a huge barrier between service-provider-oriented service information publishing and service-customer-oriented service information discovery, which causes that service consumers hardly retrieve the published servic...
متن کاملUsage of Dedicated Data Structures for URL Databases in a Large-Scale Crawling
Within the beginning of the Internet there was always a need for an automatic browsing its resources for many purposes like: indexing, cataloguing, validating, monitoring, etc. Because of a todays large volume of the World Wide Web the term Internet is often wrongly identified with the single HTTP protocol service. The process of browsing the World Wide Web in automated manner is called crawlin...
متن کاملWeb Crawler: A Review
Information Retrieval deals with searching and retrieving information within the documents and it also searches the online databases and internet. Web crawler is defined as a program or software which traverses the Web and downloads web documents in a methodical, automated manner. Based on the type of knowledge, web crawler is usually divided in three types of crawling techniques: General Purpo...
متن کامل